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Abstract 

Choosing  features  for  the  critic  in  actor-critic  algorithms  with  function  approximation  is  known  to  be  a  challenge. 
Too  few  critic  features  can  lead  to  degeneracy  of  the  actor  gradient,  and  too  many  features  may  lead  to  slower  conver¬ 
gence  of  the  learner.  In  this  paper,  we  show  that  a  well-studied  class  of  actor  policies  satisfy  the  known  requirements 
for  convergence  when  the  actor  features  are  selected  carefully.  We  demonstrate  that  two  popular  representations  for 
value  methods  -  the  barycentric  interpolators  and  the  graph  Laplacian  proto-value  functions  -  can  be  used  to  repre¬ 
sent  the  actor  in  order  to  satisfy  these  conditions.  A  consequence  of  this  work  is  a  generalization  of  the  proto- value 
function  methods  to  the  continuous  action  actor-critic  domain.  Finally,  we  analyze  the  performance  of  this  approach 
using  a  simulation  of  a  torque-limited  inverted  pendulum. 


1  Introduction 

Actor-Critic  (AC)  algorithms,  initially  proposed  by  Barto  et  al.  (1983),  aim  at  combining  the  strong  elements  of  the  two 
major  classes  of  reinforcement  learning  algorithms  -  namely  the  value-based  methods  and  the  policy  search  methods. 
As  in  value-based  methods,  the  critic  component  maintains  a  value  function,  and  as  in  policy  search  methods,  the  actor 
component  maintains  a  separate  parameterized  stochastic  policy  from  which  the  actions  are  drawn.  This  combination 
may  offer  the  convergence  guarantees  which  are  characteristic  of  the  policy  gradient  algorithms  as  well  as  an  improved 
convergence  rate  because  the  critic  can  be  used  to  reduce  the  variance  of  the  policy  update(Konda  and  Tsitsiklis,  2003). 

Recent  AC  algorithms  use  a  function  approximation  architecture  to  maintain  both  the  actor  policy  and  the  critic 
(state-action)  value  function,  relying  on  Temporal  Difference  (TD)  learning  methods  to  update  the  critic  parameters. 
Konda  and  Tsitsiklis  (2000)  and  Sutton  et  al.  (2000)  showed  that  in  order  to  compute  the  gradient  of  the  performance 
function  (typically  using  the  average  cost  criterion)  with  respect  to  the  parameters  of  a  stochastic  policy  p,g(x.,  u) 
it  suffices  to  compute  the  projection  of  the  state-action  value  function  onto  a  sub-space  'T  spanned  by  the  vectors 
ipe(x,  u)  =  ^-log /ie(x,  u).  Konda  and  Tsitsiklis  (2003)  also  noted  that  for  certain  values  of  the  policy  parameters  9, 
it  is  possible  that  the  vectors  are  either  close  to  zero,  or  almost  linearly  dependent.  In  these  situations  the  projection 
onto  becomes  ill-conditioned,  providing  no  useful  gradient  information,  and  the  algorithm  can  become  unstable. 
As  a  remedy  for  this  problem  the  authors  suggested  the  use  of  a  richer,  higher  dimensional  set  of  critic  features  which 
contain  the  space  as  a  proper  subset. 

In  this  paper,  we  attempt  to  design  features  which  span  and  preserve  linear  independence  without  increasing  the 
dimensionality  of  the  critic.  In  particular,  we  investigate  stochastic  actor  policies  represented  by  a  family  of  Gaussian 
distributions  where  the  mean  of  the  distribution  is  linearly  parameterized  using  a  set  of  a  fixed  basis  functions.  For 
this  parameterization,  we  show  that  if  the  basis  functions  in  the  actor  are  selected  to  be  linearly  independent,  then 
the  minimal  set  of  critic  features  which  naturally  satisfy  the  containment  condition  also  form  a  linearly  independent 
basis  set.  Additionally,  if  the  actor  basis  set  is  linearly  independent  of  the  function  T  then  the  critic  features  satisfy  a 
weak  version  of  the  non-zero  projection  condition  specified  by  Konda  and  Tsitsiklis  (2003).  This  suggests  that  feature 
sets  which  have  been  proposed  for  representing  value  functions,  such  as  the  proto-value-functions  (Mahadevan  and 
Maggioni,  2006),  may  also  have  promise  as  features  for  actor-critic  algorithms.  This  extends  the  proto-value  function 
approach,  which  traditionally  works  by  discretizing  the  action  space,  to  a  continuous  action  actor-critic  domain. 
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The  rest  of  this  paper  is  organized  as  follows:  in  Section  2  we  provide  a  brief  review  of  the  AC  algorithms  with 
function  approximation.  In  Section  3  we  present  our  main  theoretical  results  by  investigating  a  family  of  parameterized 
Gaussian  policies.  In  Section  4  we  consider  candidate  features  which  satisfy  these  results.  In  Section  5  we  describe  an 
empirical  evaluation  of  our  approach  in  a  simulated  control  domain.  Finally,  we  discuss  some  implications  and  future 
directions  in  Section  6  and  conclude  in  Section  7. 


2  Preliminaries 


In  this  section  we  present  a  brief  overview  of  the  AC  algorithms  with  function  approximation  adapted  from  Konda  and 
Tsitsiklis  (2003).  Assume  that  the  problem  is  modeled  as  a  Markov  decision  process  Xi  =  {X,U,  V,C ),  where  X  is 
the  state  space,  U  is  the  action  space,  V(x'\x,  u)  is  the  transition  probability  function,  C  :  X  x  IA  — >  3R  is  the  one  step 
cost  function,  and  /j,g  is  a  stochastic  policy  parameterized  by  9  £  3?"  where  /j,g(u\x)  gives  the  probability  of  selecting 
an  action  u  in  state  x,  parameterized  by  the  vector  9  £  Rn.  We  also  assume  that  for  every  9  £  3?",  the  Markov  chains 
{2ffc}  and  {Xj.,  Uk}  are  irreducible  and  aperiodic,  with  stationary  probabilities  Trg(x)  and  rjg(x,  u )  =  ng(x)iig(u\x) 
respectively. 

The  average  cost  function  d(9)  :  3?"  — >  3f?  can  be  defined: 

a(9)  =  ^2  c(x,u)r]g(x,u ) 

XdzX  ,U(zh{ 

For  each  9  £  3 let  Vg  :  X  ->  3?,  and  Qg  :  X  x  U  be  the  differential  state,  and  the  differential  state-action  cost 
functions  that  are  solution  to  the  corresponding  Poisson  equations  in  a  standard  average  cost  setting.  Then,  following 
the  results  of  Marbach  and  Tsitsiklis  (1998),  the  gradient  of  the  average  cost  function  can  be  expressed  as: 


Vga(9)  =  y^/r]g(x,u)  Qg(x,u)l/jg(x,u) 

x,u 


(1) 


where: 


1pg(x,u)  =  Vglnflg(u\x) 

The  ith  component  of  ipg,  i/)ls(x,  u)  is  the  one-step  eligibility  of  parameter  i  in  state-action  pair  x  and  u: 

d 

fpe(x,u)  =  —  hyig{u\x). 


(2) 

(3) 


Following  Konda  and  Tsitsiklis  (2003),  we  will  assume  that  X  and  U  are  discrete,  countably  infinite  sets.  We  will 
therefore  refer  to  as  the  actor  eligibility  vector ,  a  vector  in  For  any  9  £  3R",  the  inner  product  (•,  -)g  of  two 

real-valued  functions  Q\,  Q2  on  X  x  U,  also  viewed  as  vectors  in  can  be  defined  by: 


(Qi,  0.2)8  =  ^2  Vein,  u)Q i(x,  u)Q2{x,  u) 

x,u 

and  let  ||  ■  ||@  denote  the  norm  induced  by  this  inner  product  on  Now,  we  can  rewrite  Equation  1  as: 

d 

—a(9)  =  (Qe,'tpl0)g,  i  =  1, . . .  ,n. 

For  each  9  £  3?n,  let  Tg  denote  the  span  of  the  vectors  1  <  *  <  n}  in  An  important  observation  is  that 

although  the  gradient  of  d  depends  on  the  function  Qg,  which  is  a  vector  in  a  possibly  very  high-dimensional  space 
sRl^l^l,  the  dependence  is  only  through  its  inner  products  with  vectors  in  'l-V  Thus,  instead  of  “learning”  the  function 
Qg,  it  suffices  to  learn  its  projection  on  the  low-dimensional  sub-space  T g . 
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Konda  and  Tsitsiklis  (2003)  consider  actor-critic  algorithms  where  the  critic  is  a  TD  algorithm  with  a  linearly 
parameterized  approximation  architecture  for  the  Q-value  function  that  admits  the  linear-additive  form: 

m 

Qff(x,u)  =  J2rJ</>(i(x,u)  (4) 

3=1 

where  r  =  (r 1 , . . . ,  rm)  £  5ftm  is  the  parameter  vector  of  the  critic.  The  critic  features  <f>i ,  j  =  1 ,m  depend  on  the 
actor  parameter  vector  and  are  chosen  so  that  the  following  assumptions  are  satisfied:  (1)  For  every  (x,  u)  £  X  x  U, 
the  map  9  — »  <fig( x,  u)  is  bounded  and  differentiable;  (2)  The  span  of  the  vectors  <f>l  (j =  1, . . . ,  m)  in  denoted 

by  $0,  contains  T'g. 

As  noted  by  Konda  and  Tsitsiklis  (2003),  one  trivial  choice  for  satisfying  the  second  condition  would  be  to  set 
T  =  <I>,  or  in  other  words  to  set  critic  features  as  (f>le  =  tpg.  However,  it  is  possible  that  for  some  values  of  9,  the 
features  are  either  close  to  zero,  or  almost  linearly  dependent.  In  these  situations  the  projection  of  Q’g  onto  \P 
becomes  ill-conditioned,  providing  no  useful  gradient  information,  and  therefore  the  algorithm  may  become  unstable. 
Konda  and  Tsitsiklis  (2003)  suggest  some  ideas  to  remedy  to  this  problem.  In  particular,  the  troublesome  situations 
are  avoided  if  the  following  condition  is  satisfied:  (3)  There  exists  a  >  0,  such  that  for  every  r  £  ft'",  and  9  £  'ft": 

Wr'faWl  >a|r|2 

where  (j)  =  {</)*  } "l-,  are  defined  as: 

4>e  (x,  u)  =  4>\  O,  u)  -  Ve  ( x ,  u )  4>l  (5,  u)  (5) 

x,u 


This  condition  can  be  roughly  explained  as  follows:  the  new  functions  <f>g  can  be  viewed  as  the  original  critic  features 
with  their  expected  value  (with  respect  to  the  distribution  r/g(x,  u))  removed.  In  order  to  ensure  that  the  projection  of 
Qg  onto  rp  contains  some  gradient  information  for  the  actor  (and  to  avoid  instability),  the  set  (bg  must  be  uniformly 
bounded  away  from  zero.  Given  these  conditions,  Konda  and  Tsitsiklis  (2003)  prove  convergence  for  of  the  most 
common  form  for  the  actor-critic  update  (see  Konda  and  Tsitsiklis  (2003,  p.l  148)  for  the  updates). 

Konda  and  Tsitsiklis  (2003)  go  on  to  propose  adding  additional  features  to  the  critic  as  a  remedy,  but  satisfying 
this  condition  is  still  a  difficult  problem.  To  the  best  of  our  knowledge  there  is  no  general  systematic  approach  for 
choosing  a  set  of  critic  features  that  satisfies  this  third  condition.  In  the  next  section,  we  will  address  this  issue  for  one 
commonly  used  policy  class. 

3  Our  Approach 

We  consider  the  following  popular  Gaussian  probabilistic  policy  structure  parameterized  by  9: 

Mu!x)  =  —  ,nm,i  exp{-i(u  -  me(x))TS-1  (u  -  me(x))}  (6) 

(27 r)  2  |£|  2  Z 

where  u  £  5ftfc  is  a  multi-dimensional  action  vector.  The  vector  nij(x)  £  ft/i:  is  the  mean  of  the  distribution  that  is 
parameterized  by  9 : 

n 

m6i(x)  =  5Z0l-V(x),  i  =  l,...,k 

3=1 

where  in  this  setting  9  £  3ftfcxn.  The  functions  p-'  (x)  are  a  set  of  actor  features  defined  over  the  states.  For  simplicity, 
in  this  paper  we  only  investigate  the  case  where  S  =  Cq  I.  In  this  case  Equation  6  simplifies  to: 

M u|x)  =  \  exp{  — -^(u  -  m0(x))T(u  -  m0(x))} 

(2tt)2<7£  4ct0 
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Using  Equation  3  we  can  compute  the  actor  eligibility  vectors  as  follows: 

V#(x,u)  =  ^jln/x0(u|x) 

=  — ln((27r)^cro)  —  (u  —  m«(x))T  (u  —  me(x))} 

=  ^2  (u  -  me(x))T^Jme(x) 

=  -4(>i  -m^(x))pJ(x) 

°o 

=  Ke(x,u)pf(x) 


where  Kg  (x,  u)  =  (-L1>  [n  orc|er  to  satisfy  the  condition  (2)  in  the  previous  section  ( 3>  should  properly  con¬ 

tain  tP),  we  apply  the  straightforward  solution  of  setting  (plJ  =  %p 13  for  i  =  1, . . . ,  k  and  j  =  1, ,  n.  This  selection 
also  guarantees  that  the  mapping  from  9  to  (pg  is  bounded  and  differentiable,  from  condition  (1).  In  Proposition  1,  we 
show  that  for  the  particular  choice  of  policy  structure  that  we  have  chosen,  if  the  actor  features,  p3(x),  are  linearly 
independent,  then  the  critic  features  will  also  be  linearly  independent. 


Proposition  1:  If  the  functions  p  =  {p3  }"=1  are  linearly  independent,  then  the  set  of  critic  feature  functions  <j>13 
will  form  a  linearly  independent  set  of  functions. 


Proof:  We  prove  by  contradiction  that  if  the  above  condition  holds,  then  the  set  of  actor  eligibility  functions  tp13 
(and  therefore  also  4>'3 )  are  linearly  independent.  Assume  that  the  functions  tp13  are  linearly  dependent.  Then  there 
exists  a  =  {a^  €  J=1  such  that 

k,n 

a.ij  -0u  (x,  u)  =  0,  Vx  g  X,  Vu  g  u , 

*=1,7=1 

and  II  a  l|2>  0-  Substituting  the  right  hand  side  of  the  Equation  7  for  'ip13  (x,  u)  yields: 

k,n 

^  atij  Kg(x,  u)p3(x)  =0,  Vx  6  X,  Vu  g  U. 

*=i,V=i 

By  regrouping  terms  we  obtain: 


n  /  k 

°i3KUx>  U) 

3=1  \*= 1 


p3(x)  =  0, 


Vx  e  X,  Vu  g  u. 


Since  according  to  the  assumption  the  functions  p  =  {p3}Tj=1  are  linearly  independent,  then  the  following  condition 
must  hold: 

k 

ctijKg(x,  u)  =  0,  Vj  =  1, ,  n,  Vx  e  X,  Vu  g  U  (8) 

<=1 

But  for  every  i,  there  exists  an  (x,  u)  such  that  Kg(x,  u)  ^  0: 

Kg(x,  m0(x)  +  ejl)  =  ■%,  (9) 

ao 


where  1  is  the  k  x  1  vector  of  ones,  and  e*  ^  0.  Note  that  the  above  condition  holds  for  all  6  j  g  SR  -  {0}.  Now,  define 
a  (k  x  1)  vector  h/  (for  l  =  1, . . . ,  k)  as: 


Mj) 


r  e  if  j  ±  l 

1  2e  if  j  =  l 


(10) 
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for  some  e  >  0.  Based  on  Equation  9,  if  we  choose  u  =  mj(x)  +  h;  in  Equation  8,  we  obtain: 

1  T- 

jh;  ay  —  0,  V(  —  1)  •  •  • ;  k 
°o 

where  ay  =  [ay,  at2j ,  •  •  • ,  a*y]T.  This  gives  us  the  following  system  of  equations  (for  a  fixed  value  of  j): 

Aay  =  0,  i  =  l,...,k  (11) 

where  A kxk  =  [hily  . . .  hfc]T.  Note  that  for  the  particular  choice  of  the  vectors  h;  (Equation  10),  the  matrix  A  has 
a  full-rank  (since  the  vectors  h/  are  linearly  independent),  and  thus  the  only  solution  to  the  Equation  1 1  is  ay  =  0. 
This  means  that  ay  =  0  (for  all  i,j),  and  thus  II  a  ll2=  By  contradiction,  t/W  (and  therefore  <l>'3)  must  be  linearly 
independent. 

Proposition  1  provides  a  mechanism  for  ensuring  that  the  ^-dependent  critic  features  remain  linearly  independent 
for  all  0’s,  thereby  avoiding  a  major  source  of  potential  instabilities  in  the  AC  algorithm.  However,  to  meet  the  strict 
conditions  from  Konda  and  Tsitsiklis  (2003),  we  should  also  demonstrate  that  the  critic  features  are  uniformly  bounded 
away  from  zero.  Proposition  2  allows  us  to  demonstrate  that  a  set  of  actor  features  that  is  also  linearly  independent 
with  the  function  X  satisfies  the  weak  form  of  condition  (3). 

Proposition  2:  If  the  functions  X  and  p  =  { pJ  ,  j  =  1, ...  ,  n  are  linearly  independent,  then  the  set  of  critic 
feature  functions  0r'  and  the  function  X-  will  also  form  a  linearly  independent  set  of  functions. 


Proof  (sketch):  We  follow  the  proof  of  the  proposition  1.  Assume  that  the  functions  V',J  are  linearly  dependent. 
Then  there  exists  a  =  {ay  G  -R}*l”  J=i  U  {ai  G  SR}  such  that: 

k,n 

ay  t/W(x,  u)  +  ail  =  0,  Vx  G  T,  Vu  G  W, 

»= i,i=i 

and  II  a  ||2>  0.  Following  the  same  steps  as  in  proof  of  proposition  1,  we  obtain: 


n  /  k 

(  5ZatfKe(x>u) 

3= 1  \i= 1 


pJ(x)  +  ail  =  0, 


Vx  G  A,  Vu  G  U. 


(12) 


Since  according  to  the  assumption  the  functions  p  =  (p-'  }"=1  and  1  are  linearly  independent,  then  ai  =  0. 
Following  the  rest  of  the  steps  in  proof  of  the  proposition  1,  it  can  be  also  established  that  ay  =  0  (for  all  i,j).  This 
completes  the  proof. 

Konda  and  Tsitsiklis  (2003)  prove  that  if  the  functions  X  and  the  critic  features  V}  are  linearly  independent  for 
each  0,  then  there  exists  a  positive  function  a(0)  such  that: 


II  Hi  > 


(13) 


(refer  to  Section  2,  Equation  5  for  the  definition  of  do).  This  is  the  weak  form  of  the  non-zero  projection  property. 

Finally,  it  should  be  noted  that  it  is  also  possible  to  tune  the  standard-deviation  of  the  policy  distribution,  <r o,  as 
a  function  of  state  using  additional  policy  parameters,  w.  If  we  parameterize  ao(x )  =  [1  +  exp(—  JT  Wip^x))]^1 , 
then  the  eligibility  of  this  actor  parameter  takes  the  form: 

d 

—  In  ne,w{x,u)  =  (( u  -  me(x))2  -  (To^’))  (1  -  ^(x))  p\x)  =  Klg  w(x ,  u) pJ (x) . 

It  can  be  shown  that  this  set  of  vectors  forms  a  linearly  independent  basis  set,  which  is  also  independent  of  the  bases 
T. 
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4  Candidate  Features 

In  this  section  we  investigate  two  different  approaches  for  choosing  linearly  independent  actor  features,  p  '  fxj. 


4.1  Bary centric  Interpolation 

Barycentric  interpolants  described  in  (Munos  and  Moore,  1998,  2002)  are  defined  as  an  arbitrary  set  of  (non¬ 
overlapping)  mesh  points  distributed  across  the  state  space.  We  denote  the  vector-valued  output  of  the  function 
approximator  at  each  mesh  point  as  m(£,).  For  an  arbitrary  x,  if  we  define  a  simplex  S'(x)  £  {£i, £n}  such  that  x 
is  in  the  interior  of  the  simplex,  then  the  output  at  x  is  given  by  interpolating  the  mesh  points  £  £  ,S’(x): 

m<?(x)  =  m(&)A<h(x)- 

C.eS(x) 

Note  that  the  interpolation  is  called  barycentric  if  the  positive  coefficients  A ^  (x)  sum  to  one,  and  if  x  =  (x) 

(Munos  and  Moore,  1998).  In  addition,  Munos  and  Moore  (1998)  denote  the,  piecewise  linear  barycentric  interpolation 
functions  as  functions  for  which  the  interpolation  uses  exactly  dim(x.)  + 1  mesh  points  such  that  the  simplex  for  state  x 
is  the  simplex  which  forms  a  triangulation  of  the  state  space  and  does  not  contain  any  interior  mesh  points.  Barycentric 
interpolators  are  a  popular  representation  for  value  functions,  because  they  provide  a  natural  mechanism  for  variable 
resolution  discretization  of  the  value  function,  and  the  barycentric  co-ordinates  allow  the  interpolators  to  be  used 
directly  by  value  iteration  algorithms. 

These  interpolators  also  represent  a  linear  function  approximation  architecture;  we  confirm  here  that  the  feature 
vectors  are  linearly  independent.  Let  us  consider  the  output  at  the  mesh  points  as  the  parameters,  =  m(£,;),  and  the 
interpolation  function  as  the  features  p,  (x)  =  A^  (x). 

Proposition  3:  The  features  pi(x)  formed  by  the  piecewise  linear  barycentric  interpolation  of  a  non-overlapping 
mesh  (£i  ^  ,  Vi  ^  j)  form  a  linearly  independent  basis  set. 

Proof  (sketch):  For  a  non-overlapping  mesh,  consider  the  solution  for  the  barycentric  weights  of  a  piecewise 
linear  barycentric  interpolation  evaluated  at  x  =  There  are  multiple  simplices  £(x)  that  contain  x,  but  for  each 
such  simplex,  x  is  a  vertex  of  that  simplex.  By  definition  of  a  simplex,  x  is  linearly  independent  of  all  other  vertices 
of  each  simplex.  As  a  result,  the  unique  solution  for  the  barycentric  weights  is  pi(x)  =  1,  p_,  (x)  =  0,  V)  ^  i.  Since 
for  each  feature  we  can  find  an  x  which  is  non-zero  for  only  that  feature,  the  basis  set  must  be  linearly  independent. 

Note  that  the  traditional  barycentric  interpolators  are  not  constrained  to  be  linearly  independent  from  the  function 

1. 

4.2  Graph  Laplacian 

Proto-Value  Functions  (PVFs)  (Mahadevan  and  Maggioni,  2006)  have  recently  shown  some  success  in  automatic 
learning  of  representations  in  the  context  of  function  approximation  in  MDPs.  In  this  approach,  the  agents  learn 
global  task-independent  basis  functions  that  reflect  the  large-scale  geometry  of  the  state-action  space  that  all  task- 
specific  value  functions  must  adhere  to.  Such  basis  functions  are  learned  based  on  the  topological  structure  of  graphs 
representing  the  state  (or  state-action)  space  manifold.  PVFs  are  essentially  a  subset  of  eigenfunctions  of  the  graph 
Laplacian  computed  from  a  random  walk  graph  generated  by  the  agent.  We  show  here  that  if  the  proto-value  functions 
are  used  instead  to  represent  features  of  the  actor,  instead  of  the  critic,  then  this  representation  satisfies  our  Proposition 

2. 


Proposition  4:  If  the  functions  {p1  }”=1  are  the  proto-value  functions  computed  from  the  graph  generated  by  a 
random  walk  in  state  space,  then  the  set  of  critic  features  and  the  function  1  will  form  a  linearly  independent  basis  set, 
and  will  satisfy  the  weak  form  of  the  non-zero  projection  property  presented  in  Equation  13. 
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Figure  1:  Performance  comparison  between  MCPG  and  AC  algorithm.  These  results  are  averaged  over  5  trials.  Note 
that  this  is  an  infinite-horizon  average  cost  setting;  the  entire  x-axis  represents  1000  seconds  of  simulated  data. 


Proof  (sketch):  Since  the  functions  {pJ  }”=1  are  the  eigen-functions  of  the  graph  Laplacian  computed  from  the 
graph  generated  by  a  random  walk  in  state  space,  they  are  linearly  independent.  Note  that  the  function  1  is  always 
the  eigenfunction  of  any  graph  Laplacian  associated  with  the  eigenvalue  0.  That  implies  the  functions  are 

also  linearly  independent  of  the  function  1.  Based  on  the  results  of  the  Proposition  2,  the  critic  feature  functions  (f>%:l 
are  also  linearly  independent  and  also  linearly  independent  of  the  function  1,  and  the  features  will  satisfy  the  weak 
non-zero  projection  property. 


5  Experiments 

We  demonstrate  the  effectiveness  of  our  feature  selections  by  learning  a  control  policy  for  the  swing-up  task  on  a 
torque-  limited  inverted  pendulum,  governed  by  q  =  — p  [t  —  bq  —  mgl  cos  q ]  ,  with  m  =  1,1  =  l,b  =  1,  g  = 
9.8,  |r|  <  3,  and  initial  conditions  q  =  \-  q  =  0.  We  use  an  infinite-horizon,  average  reward  formulation  (no 
resetting)  with  the  instantaneous  cost  function  g(q,  q,r)  =  l(q  —  f)2  +  ^  q2  +  jqT2.  The  policy  is  evaluated  every 
dt  =  0.1  seconds;  r  is  held  constant  (zero-order  hold)  between  evaluations. 

Samples  for  our  graph  Laplacian  are  generated  using  rapidly-exploring  randomized  trees  (La Valle  and  Kuffner, 
2000)  for  coverage.  Note  that  this  is  in  place  of  the  traditional  “behavioral  policy”  used  to  identify  the  proto-value 
functions;  it  provides  a  fast  and  efficient  coverage  of  our  continuous  state  space. 

For  comparison,  we  also  implemented  the  Markov  Chain  Policy  Gradient  algorithm  (MCPG)  (Algorithm  1  in 
(Baxter  and  Bartlett,  2001)).  For  both  methods  the  policy  is  parametrized  as  in  Equation  6  using  PVFs  as  the  basis 
set  in  the  actor  (i.e.,  the  functions  pi).  Figure  1  shows  the  moving  mean  of  the  average  cost  of  the  AC  and  MCPG 
algorithms,  each  averaged  over  five  trials.  Each  trial  starts  the  pendulum  from  the  initial  condition,  with  the  parameters 
of  the  actor  and  critic  initialized  to  small  random  values.  We  use  a  continuous  setting  in  the  AC  experiment  that  consists 
of  10, 000  steps  (updates).  In  the  MCPG  experiment,  each  trial  consists  of  100  episodes  of  length  100  steps.  At  the 
beginning  of  each  episode  the  pendulum  is  reset  to  the  initial  condition  and  accumulates  the  updates  until  the  end 
of  the  episode  where  the  policy  parameters  are  adjusted  using  the  updates  collected  throughout  an  episode.  The  key 
observation  in  this  figure  is  that  the  AC  method,  using  the  policy  structure  that  we  described  in  Equation  6  together  with 
PVFs,  converges  smoothly  to  a  local  minimum.  The  comparison  with  MCPG  supports  the  promise  of  AC  algorithms 
to  outperform  pure  policy  gradient  methods.  We  also  conducted  an  experiment  where  we  used  a  polynomial  basis  set 
(instead  of  PVFs)  which  quickly  led  to  a  degenerate  case. 
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6  Discussion 


Our  goal  in  this  work  has  been  to  provide  a  mechanism  for  designing  a  minimal  set  of  critic  features  which  satisfy  the 
conditions  required  for  the  convergence  proof  provided  by  Konda  and  Tsitsiklis  (2003).  It  should  be  noted  that  meeting 
these  conditions  in  a  strict  sense  may  not  be  necessary  for  obtaining  stable  and  efficient  actor-critic  algorithms — these 
conditions  were  used  to  facilitate  the  convergence  proofs.  Independent  of  the  convergence  proof,  a  linearly  independent 
basis  set  should  generally  permit  faster  learning  than  a  linearly  dependent  basis  set.  Additionally,  linear  independence 
from  the  function  1  takes  advantage  of  the  well-known  property  that  value  functions  can  be  offset  by  a  constant  value 
and  retain  the  same  gradient  information  for  the  policy;  there  is  no  policy  gradient  information  in  the  value  function 
along  the  direction  of  1.  Therefore,  the  qualities  we  have  investigated  appear  to  have  general  merit  for  representations 
used  in  value  learning. 

One  could  imagine  other  metrics  which  describe  good  critic  features  for  actor-critic  algorithms.  For  example,  the 
graph  Laplacians  attempt  to  capture  some  of  the  geometry  of  the  problem  in  a  sparse  representation.  One  could  also 
investigate  the  qualities  that  are  most  desirable  for  the  actor  representation  (besides  permitting  a  good  critic  feature 
set).  Although  omitting  X  from  the  graph  Laplacian  bases  is  attractive  because  the  remaining  bases  satisfy  the  non-zero 
projection  property,  when  we  choose  to  do  this,  we  are  certainly  restricting  the  policy  class.  The  inability  to  express 
a  constant  bias  in  the  policy  may  be  an  undesirable  penalty  for  some  problems.  It  is  also  worth  noting  that  there  are 
other  forms  of  the  actor-critic  algorithm  which  do  not  depend  on  a  well-formed  gradient  spanning  the  the  full  value 
function  to  guarantee  convergence  (e.g.,  Kimura  and  Kobayashi  (1998)). 


7  Conclusions 

In  this  paper,  we  provide  some  insights  for  designing  features  for  actor-critic  algorithms  with  function  approximation. 
For  a  limited  policy  class,  we  demonstrate  that  a  linearly  independent  feature  set  in  the  actor  permits  a  linearly  inde¬ 
pendent  feature  set  in  the  critic.  This  condition  is  satisfied  by  the  piecewise  linear  barycentric  interpolators,  and  by 
the  features  based  on  a  graph  Laplacian.  When  combined  with  an  additional  linear  independence  with  the  function 
1,  the  critic  features  for  any  particular  6  are  uniformly  bounded  away  from  zero.  This  condition  is  satisfied  by  the 
graph  Laplacian  features.  Finally,  our  experimental  results  demonstrate  that  our  proposed  representation  smoothly  and 
efficiently  converges  to  a  local  minimum  for  a  simulated  inverted  pendulum  control  task. 
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